The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
As the basis for prehensile manipulation, it is vital to enable robots to grasp as robustly as humans. In daily manipulation, our grasping system is prompt, accurate, flexible and continuous across spatial and temporal domains. Few existing methods cover all these properties for robot grasping. In this paper, we propose a new methodology for grasp perception to enable robots these abilities. Specifically, we develop a dense supervision strategy with real perception and analytic labels in the spatial-temporal domain. Additional awareness of objects' center-of-mass is incorporated into the learning process to help improve grasping stability. Utilization of grasp correspondence across observations enables dynamic grasp tracking. Our model, AnyGrasp, can generate accurate, full-DoF, dense and temporally-smooth grasp poses efficiently, and works robustly against large depth sensing noise. Embedded with AnyGrasp, we achieve a 93.3% success rate when clearing bins with over 300 unseen objects, which is comparable with human subjects under controlled conditions. Over 900 MPPH is reported on a single-arm system. For dynamic grasping, we demonstrate catching swimming robot fish in the water.
translated by 谷歌翻译
Knowledge distillation is often used to transfer knowledge from a strong teacher model to a relatively weak student model. Traditional knowledge distillation methods include response-based methods and feature-based methods. Response-based methods are used the most widely but suffer from lower upper limit of model performance, while feature-based methods have constraints on the vocabularies and tokenizers. In this paper, we propose a tokenizer-free method liberal feature-based distillation (LEAD). LEAD aligns the distribution between teacher model and student model, which is effective, extendable, portable and has no requirements on vocabularies, tokenizer, or model architecture. Extensive experiments show the effectiveness of LEAD on several widely-used benchmarks, including MS MARCO Passage, TREC Passage 19, TREC Passage 20, MS MARCO Document, TREC Document 19 and TREC Document 20.
translated by 谷歌翻译
Automatic parsing of human anatomies at instance-level from 3D computed tomography (CT) scans is a prerequisite step for many clinical applications. The presence of pathologies, broken structures or limited field-of-view (FOV) all can make anatomy parsing algorithms vulnerable. In this work, we explore how to exploit and conduct the prosperous detection-then-segmentation paradigm in 3D medical data, and propose a steerable, robust, and efficient computing framework for detection, identification, and segmentation of anatomies in CT scans. Considering complicated shapes, sizes and orientations of anatomies, without lose of generality, we present the nine degrees-of-freedom (9-DoF) pose estimation solution in full 3D space using a novel single-stage, non-hierarchical forward representation. Our whole framework is executed in a steerable manner where any anatomy of interest can be directly retrieved to further boost the inference efficiency. We have validated the proposed method on three medical imaging parsing tasks of ribs, spine, and abdominal organs. For rib parsing, CT scans have been annotated at the rib instance-level for quantitative evaluation, similarly for spine vertebrae and abdominal organs. Extensive experiments on 9-DoF box detection and rib instance segmentation demonstrate the effectiveness of our framework (with the identification rate of 97.0% and the segmentation Dice score of 90.9%) in high efficiency, compared favorably against several strong baselines (e.g., CenterNet, FCOS, and nnU-Net). For spine identification and segmentation, our method achieves a new state-of-the-art result on the public CTSpine1K dataset. Last, we report highly competitive results in multi-organ segmentation at FLARE22 competition. Our annotations, code and models will be made publicly available at: https://github.com/alibaba-damo-academy/Med_Query.
translated by 谷歌翻译
Jacobian and Hessian regularization aim to reduce the magnitude of the first and second-order partial derivatives with respect to neural network inputs, and they are predominantly used to ensure the adversarial robustness of image classifiers. In this work, we generalize previous efforts by extending the target matrix from zero to any matrix that admits efficient matrix-vector products. The proposed paradigm allows us to construct novel regularization terms that enforce symmetry or diagonality on square Jacobian and Hessian matrices. On the other hand, the major challenge for Jacobian and Hessian regularization has been high computational complexity. We introduce Lanczos-based spectral norm minimization to tackle this difficulty. This technique uses a parallelized implementation of the Lanczos algorithm and is capable of effective and stable regularization of large Jacobian and Hessian matrices. Theoretical justifications and empirical evidence are provided for the proposed paradigm and technique. We carry out exploratory experiments to validate the effectiveness of our novel regularization terms. We also conduct comparative experiments to evaluate Lanczos-based spectral norm minimization against prior methods. Results show that the proposed methodologies are advantageous for a wide range of tasks.
translated by 谷歌翻译
Deep learning has enabled realistic face manipulation (i.e., deepfake), which poses significant concerns over the integrity of the media in circulation. Most existing deep learning techniques for deepfake detection can achieve promising performance in the intra-dataset evaluation setting (i.e., training and testing on the same dataset), but are unable to perform satisfactorily in the inter-dataset evaluation setting (i.e., training on one dataset and testing on another). Most of the previous methods use the backbone network to extract global features for making predictions and only employ binary supervision (i.e., indicating whether the training instances are fake or authentic) to train the network. Classification merely based on the learning of global features leads often leads to weak generalizability to unseen manipulation methods. In addition, the reconstruction task can improve the learned representations. In this paper, we introduce a novel approach for deepfake detection, which considers the reconstruction and classification tasks simultaneously to address these problems. This method shares the information learned by one task with the other, which focuses on a different aspect other existing works rarely consider and hence boosts the overall performance. In particular, we design a two-branch Convolutional AutoEncoder (CAE), in which the Convolutional Encoder used to compress the feature map into the latent representation is shared by both branches. Then the latent representation of the input data is fed to a simple classifier and the unsupervised reconstruction component simultaneously. Our network is trained end-to-end. Experiments demonstrate that our method achieves state-of-the-art performance on three commonly-used datasets, particularly in the cross-dataset evaluation setting.
translated by 谷歌翻译
Multi-modal fusion is a basic task of autonomous driving system perception, which has attracted many scholars' interest in recent years. The current multi-modal fusion methods mainly focus on camera data and LiDAR data, but pay little attention to the kinematic information provided by the bottom sensors of the vehicle, such as acceleration, vehicle speed, angle of rotation. These information are not affected by complex external scenes, so it is more robust and reliable. In this paper, we introduce the existing application fields of vehicle bottom information and the research progress of related methods, as well as the multi-modal fusion methods based on bottom information. We also introduced the relevant information of the vehicle bottom information data set in detail to facilitate the research as soon as possible. In addition, new future ideas of multi-modal fusion technology for autonomous driving tasks are proposed to promote the further utilization of vehicle bottom information.
translated by 谷歌翻译
Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt / 30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
Federated learning (FL) enables the building of robust and generalizable AI models by leveraging diverse datasets from multiple collaborators without centralizing the data. We created NVIDIA FLARE as an open-source software development kit (SDK) to make it easier for data scientists to use FL in their research and real-world applications. The SDK includes solutions for state-of-the-art FL algorithms and federated machine learning approaches, which facilitate building workflows for distributed learning across enterprises and enable platform developers to create a secure, privacy-preserving offering for multiparty collaboration utilizing homomorphic encryption or differential privacy. The SDK is a lightweight, flexible, and scalable Python package, and allows researchers to bring their data science workflows implemented in any training libraries (PyTorch, TensorFlow, XGBoost, or even NumPy) and apply them in real-world FL settings. This paper introduces the key design principles of FLARE and illustrates some use cases (e.g., COVID analysis) with customizable FL workflows that implement different privacy-preserving algorithms. Code is available at https://github.com/NVIDIA/NVFlare.
translated by 谷歌翻译
Temporal modeling is crucial for various video learning tasks. Most recent approaches employ either factorized (2D+1D) or joint (3D) spatial-temporal operations to extract temporal contexts from the input frames. While the former is more efficient in computation, the latter often obtains better performance. In this paper, we attribute this to a dilemma between the sufficiency and the efficiency of interactions among various positions in different frames. These interactions affect the extraction of task-relevant information shared among frames. To resolve this issue, we prove that frame-by-frame alignments have the potential to increase the mutual information between frame representations, thereby including more task-relevant information to boost effectiveness. Then we propose Alignment-guided Temporal Attention (ATA) to extend 1-dimensional temporal attention with parameter-free patch-level alignments between neighboring frames. It can act as a general plug-in for image backbones to conduct the action recognition task without any model-specific design. Extensive experiments on multiple benchmarks demonstrate the superiority and generality of our module.
translated by 谷歌翻译